INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Let's drop Booking_ID column.

  1. There are no missing values in the dataset.
  2. Type_of_meal_plan, room_type_reserved, market_segment_type and booking_status are object datatype.
  3. The other 14 variables are numerical and therefore their python data types (int64 and float64) are ok.

There are no missing values in the dataset.

  1. There are 4 type of meal plan booked by the customer.
  2. The average price per room is spread over range from 0 to 540 euros. The zero indicating free room could be outliers in the data. The mean average price per room is 103.4 euros. Mean is greater than median inidicating right skewed.
  3. The lead_time i.e., the number of days between the date of booking and the arrival date is spread over range from 0 to 443 days. The maximum is more than one year. The mean is greater than median indicating right skewed.
  4. The arrival month is spread over throughout the year and left skewed. The average arrival month is July.
  5. The arrival date is spread over throughout the month.
  6. The average repeated guest is nearly zero.
  7. The number of previous cancellations prior to curent booking is spread over the range from 0 to 13.
  8. The no_of_previous_bookings_not_canceled prior to current booking is spread from 0 to 58.
  9. The number of nights stayed at hotel in large range (from 0 to 17) at weekdays compared to weekends (0 to 7).

Exploratory Data Analysis (EDA)

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

number of adults

The 72.% booking with number of adults 2, then 21.2% of booking with number of adult is 1 and so on. The lowest 0.0% booking with maximum number of adults. The 0.4% of booking with no adults could be outliers.

number of children

92.6% booking with no children. The number 9 and 10 could be outliers.

lead time

lead_time distribution is rightly skewed and there are outliers in the distribution.

average price per room

The average price is distributed around 100 euros and slightly right skewed. There are outliers in the distribution.

number of weekend nights

The percentage of booking to stay or styed at hotel decreases with increasing number of weekend nights.

number of week nights

The customers prefers mostly 1 or 2 nights in the weekdays with 31.5%. The percentage of booking decreases with number of night increases beyond 2.

required car parking space

The customer does not required car parking space with 96.9%. Only 3.1% customers require the car parking space.

no of previous cancellations

There are outliers in the distribution.

no_of_previous_bookings not canceled

There are outliers in the distribution.

no_of_special_requests

The higher percentage with no special requests. The next higher percentage with one special request.

arrival month

The month of arrival date is maximum in October and minimum in January.

arrival_date

The arrival date is distributed nearly uniformly thoughout the month except on 31st.

type_of_meal_plan

There are 4 type of meals. The customer prefers mostly meal plan 1. The second choice is other (not_selected). The customer does not prefer meal plan 3.

room_type_reserved

There are 7 types of room to be reserved. The maximum preference is room type 1 then room type 4. The least preference is room type 3.

market_segment_type

The higher percentage in market segment designation is online. The lowest is avistion.

booking_status

67.2% booking was canceled and 32.8% booking not canceled.

Bivariate Analysis

Observations:

  1. There are finite correlations between booking_status and other factors (lead_time, repeated_guest, no_of_special_requests, avg_price_per_room).
  2. The avg_price_per_room increases with no_of_children and no_of_adults.
  3. The no_of_previous_bookings_not_canceled and repeated_guest are highly correlated.
  4. The no_of_previous_cancellations and repeated_guest are correlated.

avg_price_per_room vs. market_segment_type

The ave price per room is maximum for Online and minimum for complementary. There are outliers in all the cases except Aviation.

market_segment_type vs booking_status

The cancelation for the case of online is higher compared to others. There is no cancelation for complementary segment.

booking_status vs no_of_weekend_nights

booking_status vs no_of_special_requests

As the number of special request increases the booking cancelation decreases. That implies the higher the customer's satisfaction less in the cancelation.

avg_price_per_room vs booking_status

The number of cancelation increases untill average price is 110 euros. The cancelation decreases beyond that.

avg_price_per_room vs no_of_special_requests

The average price per room varies the no of special requests. For case of 4, the mean price is high.

lead_time vs booking_status

The number of cancelation decreases with increasing lead time

family vs booking status

The cancelation is higher for the case of family number 2. The increase in family member decreases the cancelation.

day spend vs booking status

The cancelation increases with the increase in total days.

The cancelation is less for the repeated guest as they like the room in the hotel.

the busiest months in the hotel

The October is the busiest month. The cancelation is high in October and low in January.

price vs month

The price increases until September then decreases.

Outlier detection and treatment

Data Preprocessing

Creating training and test sets.

Building the model

Model can make wrong predictions as:

Predicting a customer will not cancel their booking but in reality, the customer will cancel their booking.

Predicting a customer will cancel their booking but in reality, the customer will not cancel their booking.

Both the cases are important as:

If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs.

If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage the brand reputation.

How to reduce the losses?

Hotel would want F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives.

Logistic Regression (with statsmodels library)

Observations

  1. Negative values of the coefficient shows that the chances of a customer will cancel his booking, decreases with the increase of corresponding attribute value.

  2. Positive values of the coefficient show that the chances of a customer will cancel his booking, increases with the increase of corresponding attribute value.

  3. p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.

  4. But these variables might contain multicollinearity, which will affect the p-values.

  5. We will have to remove multicollinearity from the data to get reliable coefficients and p-values.

  6. There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.

Multicollinearity

market_segment_type exhibits high multicollinearity.

Removing market_segment_type_Offline

The multicolinearity is treated.

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train3 as the final ones and lg5 as final model.

Converting coefficients to odds

Coefficient interpretations

Checking model performance on the training set

ROC-AUC

Model Performance Improvement

Checking model performance on training set

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Model Performance Summary

Decision Tree Model

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Before pruning the tree let's check the important features.

Reducing over fitting

Pruning the tree

Pre-Pruning

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

Cost Complexity Pruning

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

F1 Score vs alpha for training and testing sets

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

For using ccp_alpha=2.4753365544464932e-05, the decision tree is still complicated. To get simplified tree, we use ccp_alpha= 0.0015

Actionable Insights and Recommendations

  1. The cancelation policies depends on interplay of lead time, type of booking, no of special requests, average price per room, and the no of weekend nights.

  2. When lead time < 90 days, customer does not use online and coporate for booking, only one special request and no weekend nights, the customer will not cancel the booking when average price per room less than 178 euros and cancel the booking when the average price is greater than 178 euros.

  3. Therefore reccomendation is that depending on the lead time the hotel need to adjust the average price per room to avoid cancelation.

  4. To avoid cancellation or resource loss when the average price greater than 178 euros, the management can consider more refund if possible full refund to the customer.

  5. In case of weekend night spend greater equal to one, the cancellation can happen when average price greater than 99 euros. The management can deal with this by recucing the price per room or increase the amount of refund.

  6. The number of special request is also an important factor. If hotel fulfil customer's request, the chances of cancellation will be reduced.

  7. The Logistic regression model shows that more the family members the booking cancellation increases. The hotel management can focus on that by increasing quality and new facilities to attract both children and adults.

  8. The online booking is also a risk factor. The higher cancellation occurs on online. The hotel management need to focus on that to reduce loss of resources. By reducing the price and increasing quality and facilities, the cancellation can be avoided.

  9. For the repeated guest the chances of cancellation decreases. By increasing quality hotel can attract the custor repeatedly.

  10. Hotel can look at the meal plan 2, as the regression model shows the coefficient is positive. They can improve the quality to avoid cancellation.

  11. The model says that more the required car parking space, less is the cancellation. Hotel can increase and improve the car parking space.

  12. Arrival month is also an important factor. The October, September, August have higher cancellation. By reducing the price the hotel can avoid cancellation.